NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Improved Ensemble Predictive Modeling Techniques for Linked Social Media and Survey Data Sets Subject to Mismatch Error

https://doi.org/10.12758/mda.2025.04

West, Brady T; Slawski, Martin; Ben-David, Emanuel (January 2025, methods, data, analyses (mda))

Modern predictive modeling tools, such as random forests (and related ensemble methods), have become almost ubiquitous in research applications involving innovative combinations of survey methodology and data science. However, an important potential flaw in the widespread application of these methods has not received sufficient research attention to date. Researchers at the junction of computer and survey science frequently leverage linked data sets to study relationships between variables, where the techniques used to link two (or more) data sets may be probabilistic and non-deterministic in nature. If frequent mismatch errors occur when linking two (or more) data sets, the commonly desired outputs of predictive modeling tools describing relationships between variables in the linked data sets (e.g., variable importance, confusion matrices, RMSE, etc.) may be negatively affected, and the true predictive performance of these tools may not be realized. We demonstrate a new methodology based on mixture modeling that is designed to adjust modern predictive modeling tools for the presence of mismatch errors in a linked data set. We evaluate the performance of this new methodology in an application involving the use of observed Twitter/X activity measures and predicted socio-demographic features of Twitter/X users to accurately predict linked measures of political ideology that were collected in a designed survey, where respondents were asked for consent to link any Twitter/X activity data to their survey responses (exactly, based on Twitter/X handles). We find that the new methodology, which we have implemented in R, is able to largely recover results that would have been seen prior to the introduction of mismatch errors in the linked data set.
more » « less
Full Text Available
A general framework for regression with mismatched data based on mixture modelling

https://doi.org/10.1093/jrsssa/qnae083

Slawski, Martin; West, Brady T; Bukke, Priyanjali; Wang, Zhenbang; Diao, Guoqing; Ben-David, Emanuel (August 2024, Journal of the Royal Statistical Society Series A: Statistics in Society)

Abstract The advent of the information age has revolutionized data collection and has led to a rapid expansion of available data sources. Methods of data integration are indispensable when a question of interest cannot be addressed using a single data source. Record linkage (RL) is at the forefront of such data integration efforts. Incentives for sharing linked data for secondary analysis have prompted the need for methodology accounting for possible errors at the RL stage. Mismatch error is a common consequence resulting from the use of nonunique or noisy identifiers at that stage. In this paper, we present a framework to enable valid postlinkage inference in the secondary analysis setting in which only the linked file is given. The proposed framework covers a variety of statistical models and can flexibly incorporate information about the underlying RL process. We propose a mixture model for linked records whose two components reflect distributions conditional on match status, i.e. correct or false match. Regarding inference, we develop a method based on composite likelihood and the expectation-maximization algorithm that is implemented in the R package pldamixture. Extensive simulations and case studies involving contemporary RL applications corroborate the effectiveness of our framework.
more » « less
Full Text Available
A Novel Methodology for Improving Applications of Modern Predictive Modeling Techniques to Linked Data Sets Subject to Mismatch Error

https://doi.org/10.1109/BigSurv59479.2023.10486610

Ben-David, Emanuel; West, Brady T; Slawski, Martin (October 2023, IEEE)

Full Text Available
Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

Wang, Zhenbang; Ben-David, Emanuel; Slawski, Martin (April 2023, Proceedings of Machine Learning Research)
Ruiz, Francisco; Dy, Jennifer; van de Meent, Jan-Willem (Ed.)
In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption is that each pair corresponds to the same observational unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this setting under the term “Shuffled Data” in which the underlying correct pairing of (X, Y)-pairs is represented via an unknown permutation. Explicit modeling of the permutation tends to be associated with overfitting, prompting the need for suitable methods of regularization. In this paper, we propose an exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and local shuffling. This prior turns out to be conjugate for canonical shuffled data problems in which the likelihood conditional on a fixed permutation can be expressed as product over the corresponding (X,Y)-pairs. Inference can be based on the EM algorithm in which the E-step is approximated by sampling, e.g., via the Fisher-Yates algorithm. The M-step is shown to admit a reduction from n^2 to n terms if the likelihood of (X,Y)-pairs has exponential family form. Comparisons on synthetic and real data show that the proposed approach compares favorably to competing methods.
more » « less
Full Text Available
Estimation in exponential family regression based on linked data contaminated by mismatch error

https://doi.org/10.4310/22-SII726

Wang, Zhenbang; Ben-David, Emanuel; Slawski, Martin (January 2023, Statistics and Its Interface)

Full Text Available
Regression with linked datasets subject to linkage error

https://doi.org/10.1002/wics.1570

Wang, Zhenbang; Ben‐David, Emanuel; Diao, Guoqing; Slawski, Martin (July 2022, WIREs Computational Statistics)

Full Text Available
A Pseudo-Likelihood Approach to Linear Regression With Partially Shuffled Data

https://doi.org/10.1080/10618600.2020.1870482

Slawski, Martin; Diao, Guoqing; Ben-David, Emanuel (October 2021, Journal of Computational and Graphical Statistics)

Full Text Available
Two-Stage Approach to Multivariate Linear Regression with Sparsely Mismatched Data

Slawski, Martin; Ben-David, Emanuel; Li, Ping (October 2020, Journal of machine learning research)
null (Ed.)
Full Text Available

Search for: All records